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METHOD AND SYSTEM FOR SIMPUEYING 
THE STRUCTURE OF DYNAMIC EXECUTION PROFILES 

CROSS REFERENCE TO RELATED APPLICATIONS 

The present application is a continuation-in-part of U.S. Patent Application 

No. 09/309,755, filed May 11, 1999, "Dynamic Software System Intrusion 
Detection," which is hereby incorporated by reference in its entirety. 



FIELD OF THE INVENTION 
1 0 The present invention generally relates to systems for dynamically detecting 

intrusive or otherwise anomalous use of computer sofltware, and more particularly to 
software profiling techniques for analyzing the behavior of a computer program or 
system, 

i 

1 5 BACKGROUND OF THE INVENTION 

The present invention is particularly suited for, but by no means limited to, 
application in the field of computer system security. The background of the invention 
will therefore be described in the context of computer system security. 

?" 

\a The literature and media abound with reports of successful violations of 

20 computer system security by both external attackers and internal users. These 
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breaches occur through physical attacks, social engineering attacks, and attacks on the 
^ system software. In a system software attack, the intruder subverts or bypasses the 

security mechanisms of the system in order to gain unauthorized access to the system 
or to increase current access privileges. These attacks are successful when the 
25 attacker is able to cause the system software to execute in a manner that is typically 
inconsistent with the software specification and thus leads to a breach in security. 

Intrusion detection systems monitor some traces of user activity to determine 
if an intrusion has occurred. The traces of activity c an be collated firom audit trails or 
logs, network monitoring or a combination of both. Once the data regarding a 
30 relevant aspect of the behavior of the system are collected, the classification stage 
starts. Intrusion detection classification techniques can be broadly catalogued in the 
two main groups: misuse intrusion detection, and anomaly intrusion detection. The 
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first type of classification technique searches for occurrences of known attacks with a 
particular "signature," and the second type searches for a departure fi:om normality. 
Some of the newest intrusion detection tools incorporate both approaches. 

One known system for detecting an intrusion is the EMERALD™ program. 

5 EMERALD defines the architecture of independent monitors that are distributed 
about a network to detect intrusions. Each monitor performs a signature or profile 
analysis of a "target event stream" to detect intrusions and communicates such 
detection to other monitors on the system. The analysis is performed on event logs, 
but the structure of the logs is not prescribed and the timeliness of the analysis and 

10 detection of an intrusion depends on the analyzed system and how it chooses to 

provide such log data. By monitoring these logs, EMERALD can thus determine that 
at some point in the event stream recorded in the log, an intrusion occurred. 
However, the detection is generally not implemented in real time, but instead occurs 
at some interval of time after the intrusion. Also, this system does not allow 

15 monitoring of all types of software activity, since it is limited to operating system 
kemel events. It would be desirable to provide a real time intrusion detection 
paradigm that is applicable to monitoring almost any type of program. 

A more general case of the intrusion detection problem is the problem of 
anomalous behavior detection. It is possible to detect anomalous behavior based on 

20 the measurement of program activity as control is passed among program control 
structures. As a system executes its customary activities, the behavior monitoring 
scheme should estimate a nominal system behavior. Departures firom the nominal 
system profile will likely represent potential anomalous activity on the system. Since 
unwanted activity may be detected by comparison of the current system activity to 

25 that occurring during previous assaults on the system, it would be desirable to store 
profiles for recognizing these activities from historical data. Historical data, however, 
cannot be used to recognize new kinds of behavior. An effective security tool would 
be one designed to recognize assaults as they occwrthrough the understanding and 
comparison of the current behavior against nominal system activity. The subject 

30 matter disclosed herein and in U.S. Patent ApplicationNo, 09/309,755 addresses 
these issues. 
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A modem software system is composed of a plurality of subcomponents called 
modules. Each program module generally implements a specific functional 
requirement As the program executes, each module may call another module. 
Through a specific sequence of calls fi:om one module to another, a higher level of 
functional granularity is achieved Many times there is a very tight binding between 
the sequences of calls to program modules. That is, one module may always call 
another module. If this is the case, there is a high correlation between the activities of 
the two modules. They are bound so closely together that they interact as one 
functional unit. One goal of the present invention is to provide improved profiling 
techniques for use in analyzing interactions between and among program modules to 
identify anomalous behavior, e.g., in the field of computer security. 

At a finer level of granularity, each program module has internal structure as 
represented by a flowgraph. There are many potential execution paths through each 
program module. By instrumenting the paths associated witii the decision points 
within each module, not only can the observation of the interaction of program 
modules be made but the interactions within the program modules may be observed as 
well. 

SUMMARY OF THE INVENTION 

The present invention provides improvements in software profiling techniques 
for understanding and analyzing the behavior of a computer program. The invention 
provides a statistical methodology to identify a small working set of cohesive program 
instrumentation points that rq)resent the dynamic bindings in the program structure as 
it executes. More specifically, the present invention provides a way to reduce the 
dimensionality of the dynamic software profiling problem space, i.e., to identify 
distinct sources of variation in the interaction of program modules and to reduce the 
number of actual observations necessary to characterize the behavior of an executing 
software system. Other aspects of the present invention are described below. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing aspects and many of the attendant advantages of the present 
invention will become more readily appreciated as the same becomes better 
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understood by reference to the following detailed description, when taken in 
conjunction witii the accompanying drawings, wherein: 

FIGURE 1 is a block diagram illustrating the operation of a mapping module 
with code instrumentation points. 
5 FIGURE 2 is a block diagram illustrating the operation of a execution profile 

comparator. 

!j FIGURE 3 is a block diagram illustrating an environment in which a preferred 

embodiment of the present invention operates. 

FIGURE 3 A is a block diagram illustrating a plurality of modules in an 
10 instrumented software system. This diagram is used in explaining a methodology in 
accordance with the present invention to identify distinct sources of variation in the 
interaction of program modules and to reduce the number of observations necessary to 
2 characterize the behavior of an executing sofltware system. 

IS FIGURE 4 is an isometric view of a generally conventional computer system 

^ 1 5 suitable for implementing the present invention. 



FIGURE 5 is block diagram showing internal components of the computer 
ly system of FIGURE 4. 



w 



DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

20 Overview 

A system in accordance with the present invention employs dynamic software 
measurement to assist in the detection of intruders. Dynamic software measurement 
provides a firamework to analyze the internal behavior of a system as it executes and 
makes transitions among its various control structures governed by the structure of a 

25 program flowgraph. A target system is instrumented so that measurements can be 
obtained to profile the execution activity on the system in real time. Essentially, this 
approach measures from the inside of a software system to make inferences as to what 
is occurring outside of the program environment. 

Program modules are distinctly associated with certain functionalities that a 

30 program is capable of performing. As each function is executed, it creates its own 
distinct signature of activity. Since the nominal behavior of a system is more 
completely understood while it is executing its customary activities, this nominal 
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system behavior can be profiled quite accurately. Departures from a nominal system 
profile represent potential anomalous activity on the system. The system disclosed 
herein is designed to recognize abnormal activity as it occurs through the 
understanding and comparison of the current behavior against nominal system 
5 activity. 

Moreover, the present invention provides a method and system for reducing 
the amount of information necessary to understand the functional characteristics of an 
executing software system. The invention permits one to identify common sources of 
variation among program modules and to build execution profiles based on a reduced 

1 0 set of virtual execution domains. 

Illustrative embodiments of a system operating in accordance with the present 
invention include two modes of operation, although the present invention is by no 
means limited thereto. Each of diese exemplary modes represents an incremental 
hnprovement in the ability of the system to detect an anomalous activity of the system 

15 or program executing thereon. However, as the resolving ability of the system 
increases, there can be an associated penalty in processing overhead. 



20 



25 



1 . In a first mode, simple execution profiles indicative of modules that 
have executed are employed for the evaluation of anomalous activity. For 
example, an execution profile may be created by maintaining a module 
sequence buffer including a hst of all modules that have executed in a given 
time frame and the frequencies of their executions. In one presently preferred 
implementation of this mode, a profile transducer accumulates the module 
frequencies until the number of module transitions that have occurred during 
the sampling interval reaches a given value. This mode is the most coarse- 
grained level of operation for the detection of anomalous behavior but, of the 
two modes, it has the least cost in terms of performance. 



30 



2. In the second mode, the software is instrumented on all paths from 
predicate nodes or decision points in each software module for the evaluation 
of intrusive activity. For example, an execution profile may be created by 
maintaining an execution path sequence buffer including a list of all decision 
paths that have executed in a given time frame and the frequencies of their 
executions. In one presently preferred implementation of this mode, a profile 



Page 5 of 38 



SOFT-0004 PATENT 

transducer accumulates frequencies until the total number of module 
transitions that have occurred during the sampling interval reaches a 
predetermined value. 

In addition, the present invention may also be implemented in a mode 
whereby the amount of profile data used to detect anomalous activity is substantially 
reduced. A potential problem with using raw execution profiles, or raw execution 
vectors, is that the volume of data generated by a typical program can be very large, 
which can generate prohibitively large computational loads on the monitoring system. 
A methodology presented herein reduces the dimensionality of the problem from a 
very large set of program instrumentation points representing small execution 
domains (modules or execution paths) whose activity is highly correlated to a much 
smaller set of virtual program domains whose activity is substantially uncorrelated. 
To achieve this reduction in the dimensionality of the problem space, the statistical 
techniques of principal components analysis or principal factor analysis may be 
employed, although the present invention is not limited to these techniques. 

Operating Environment 

FIGURE 1 illustrates the internal program envirotmient of a program that has 
been suitably instrumented for anomaly and/or intrusion detection. Each program will 
have a plurality of program instrumentation points. Control is passed to a mapping 
module 102 that records activity at that instrumentation point. The mapping module 
transmits the telemetry information to an execution profile buffer 103 that buffers 
these data until they are requested from the external program enviroimient. All of 
structures 101-103 are preferably encapsulated within the operating environment of a 
program to which the present invention is applied to detect anomalous behavior. 

FIGURE 2 shows the operation of an execution profile comparator 202. The 
execution profile comparator determines any difference (i.e., a differenced profile) 
between a current execution profile 201 most recently obtained from first profile 
transducer 202 and a nominal execution profile obtained from nominal profiles 
data 204, which represents the steady-state behavior of the software system with no 
abnormal activity. The nominal profiles data are initially established by a calibration 
process that is implemented by ruiming the program in a calibration mode in which 
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the program is run fhrough as many of the functions and operations performed during 
a nominal operational phase. A nominal activity profile and boundary conditions for 
variations during this nominal operational phase are accumulated during this 
calibration mode. The nominal profile is subsequently modified by a user (or 
administrator), if during normal operation of the program an alarm is raised, but it is 
determined that no abnormal activity has occurred. In a typical software application, 
there may be a wide range of behavior that is considered nominal. The computational 
result of the comparison between the current execution profile and the steady state 
behavior of the system represented by the nominal profile suite is used in two 
different subsystems. The current profile is analyzed against the nominal profiles data 
204 and a difference value is constructed by the execution profile comparator 202. 
This difference value is transmitted to the policy engine 203 to determine what action 
will take place based on a preset policy. 

FIGURE 3 shows tiie relationship among the various components of an 
anomaly detection system. A transducer 303 and an analysis engine 305 are 
important functional components of the anomaly detection methodology. The 
transducer obtains signals from an instrumented software system 301 and computes 
activity measures for these signals. The actual software signals may be obtained 
either from instrumented code (software probes) or directly from a hardware address 
bus (a hardware probe). The inputs to the transducer are software module entry and 
exit events that may be obtained either from software or hardware instrumentation. 

The transducer will produce fi:om the module probes, a summary profile of the 
module execution firequencies. The transducer preferably interacts with the software 
system in real-time. That is, all transition events are preferably available to the 
transducer as they occur. 

The output of transducer 303 is sent directly to analysis engine 305 for 
analysis. All sampling activity is measured in system epochs, or instrumentation 
transitions. 

Each execution profile 304 is obtained firom transducer 303 by analysis engine 
305. The comparator makes a formal assessment as to whether the current system 
activity is nominal or otherwise. There are essentially two types of non-nominal 
activity. The first occurs as new users or new programs are being run on the system. 
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This new activity is reviewed by an external decision maker, such as the system 
administrator, and if determined to be non-threatening by that person, is added to the 
system repertoire of nominal behavior in a nominal profiles database 306. However, 
the observed activity may represent anomalous activity. 

The ability to recognize an anomalous event is dependent on the variance in 
the profiles of software activity. An incoming execution profile 304 is compared 
against calibration data for nominal system activity by the analysis engine. A 
difference value is created between this execution profile and the nominal behavior. 
This difference value 307 is sent to the policy engine 308. The policy engine 
compares the difference value against a predetermined threshold set by a system 
administrator at the system console 310. If the difference value is outside of a 
predetermined range the policy engine will transmit a signal 309 to the 
countermeasures interface in the instrumented software system to take action against 
the anomalous behavior. 

To enhance the overall viability of the system to detect new and unobserved 
anomalous events, a management console 310 may optionally be added to the system. 
This management receives a data stream from the policy engine and graphically 
displays execution profile information (distances of departure from normality) in a 
continuous recording strip on a display terminal (not separately shown). The moving 
image of system activity is shown graphically by an emerging sequence of distances 
along the X-axis of the display. Module (or fimctionality) frequency data are 
displayed to render a visual image of overall system activity. The management 
console provides a real-tune image of system activity from which a system 
administrator. The management console also provides a convenient mechanism for 
the system user to mterrogate and receive information about specific system activity 
and set security policy. This security policy is implemented by adjusting the 
sensitivity of the system 312 by changing the thresholds for recognition of anomalous 
behavior. By clicking on any of the distances the user will receive information as to 
the program activity associated with each distance measure. 
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The Profile-Oriented Anomaly Detection Model 

As any software system is being specified, designed and developed, it is 
constructed to perform a distinct set of mutually exclusive operations O for the 
customer. An example of such an operation might be the activity of adding a new 
user to a computer system. At the software level, these operations may be reduced to 
a well-defined set of fimctions F. These functions represent the decomposition of 
operations into sub-problems that may be implemented on computer systems. The 
operation of adding a new user to the system might involve the functional activities of 
changing fi-om the current directory to a password file, updating the password file, 
establishing user authorizations, and creating a new file for the new user. During the 
software design process, the basic fimctions are mapped by system designers to 
specific software program modules that implement tiie functionality. 

From the standpoint of computer security, not all operations are equal. Some 
user operations may have little or no impact on computer security considerations. 
Other operations, such as system maintenance activities, have a much greater impact 
on security. System maintenance activities being performed by system administrators 
would be considered nominal system behavior. System maintenance activities being 
performed by dial-up users, on the other hand, would not be considered nominal 
system behavior. In order to implement this decomposition process, a formal 
description of these relationships must be established. To assist in the subsequent 
discussion of program functionality, it will be useful to make this description 
somewhat more precise by introducing some notation conveniences. 

Formal Description of Software Operation 

Assume that a software system 5 was designed to implement a specific set of 
mutually exclusive functionalities F. Thus, if the system is executing a function 

f^F^ then it is not expressing elements of any other functionality in F. Each of 
these functions in Fwas designed to implement a set of software specifications based 
on a user's requirements. From a user's perspective, this software system will 
implement a specific set of operations O, This mapping from the set of user perceived 
operations O to a set of specific program functionalities is one of the major functions 
in the software specification process. It is possible, then, to define a relation 
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IMPLEMENTS over OxF such that IMPLEMENTS(a/) is trae if functionality/is 
used in the specification of an operation o. 

From a computer security standpoint, operations can be envisioned as the set 
of services available to a user (e.g., login, open a file, write to a device), and 
functionalities as the set of internal operations that implement a particular operation 
(e.g., user-id validation, access control list (ACL) lookup, labeling). When viewed 
from this perspective, it is apparent that user operations, which may appear to be non- 
security relevant, may actually be implemented with security relevant functionalities 
{send mail is a classic example of this, an inoffensive operation of send mail can be 
transformed into an attack, if the functionalities that deal with buffers are thereby 
overloaded). 

The software design process is strictly a matter of assigning functionalities in 
Fto specific program modules wieAf, tihe set of program modules of system S. The 
design process may be thought of as the process of defining a set of relations 
ASSIGNS over FxM, such that ASSIGNS(/^ m) is true if functionality/is expressed 
in module m. For a given software system^', letMdenote the set of all program 
modules for that system. For each function f ^ there is a relation p overF x M 
such that pifyfn) is the proportion of execution events of module m when the system 
is executing function/ 

Each operation that a system may perform for a user may be thought of as 
having been unplemented in a set of functional specifications. There may be a one- 
to-one mapping between the user's notion of an operation and a program function. In 
most cases, however, there may be several discrete functions tibat must be executed to 
express the user's concept of an operation. For each operation o that the system may 
perform, the range of functionalities /must be well known. Within each operation, 
one or more of the system's functionalities will be expressed. 

A finer level of measurement granularity may be achieved by instrumenting 
each decision path within each program module. In this case, each branch on a 
decision path from a predicate node will be monitored with suitable instrumentation. 
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The Software Epoch 

When a program is executing a functionality, it apportions its activities among 
a set of modules. As such, it transitions from one module to the next on a call (or 
return) sequence. Each module called in this call sequence will have an associfted 
call frequency. When the software is subjected to a series of unique and distinct 
ftinctional expressions, there is a different behavior for each of the user's operations, 
in that each will implement a different set of ftinctions, which, in turn, invoke 
possibly different sets of program modules. 

An epoch begins with the encounter of an instrumentation point in the code. 
The current epoch ends when a new instrumentation point is encountered. Each 
encounter with an execution point represents an incremental change in the epoch 
number. Computer programs executing in their normal mode make state transitions 
between epochs rather rapidly. In terms of real clock time, many epochs may elapse in 
a relatively short period. 

Formal Definitions of Profiles 

It can be seen that there is a distinct relationship between any given operation 
o and a given set of program modules. That is, if the user performs a particular 
operation, then this operation manifests itself in certain modules receiving control It 
is also possible to detect, inversely, which program operations are being executed by 
observing the pattern of modules executing, i.e., the module profile. In a sense, the 
mapping of operations to modules and the mapping of modules to operations is 
reflexive. 

When the process of software requirements specification is complete, a system 
consisting of a set of mutually exclusive operations will have been defined. 
Characteristically, each user of a new system will cause each operation to be 
performed at a potentially different rate than another user. Each user, then, will induce 
a probability distribution on the set O of mutually exclusive operations. This 
probability function is a multinomial distribution and constitutes the operational 
profile for that user. 
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The operational profile of a software system is the set of unconditional 
probabilities of each of the operations in 0 being executed by the user. Then, 

Pr [X=A], k=X%, . Joj is the probability that the user is executing program operation k 

as specified in the functional requirements of the program and ||o|| is the cardinality of 
the set of operations. A program executing on a serial machine can only be executing 
one operation at a time. The distribution of the operational profile is thus multinomial 
for programs designed to ftilfiU more than two distinct operations. 

A user performing the various operations on a system causes each operation to 
occur in a series of steps or transitions. The transition from one operation to another 
may be described as a stochastic process. In this case, an indexed collection of 

random variables {X^} may be defined, where the index t runs through a set of non- 
negative integers ^ = 0, 1, 2, . . . representing the individual transitions or intervals of 
the process. At any particular interval, the user is found to be expressing exactly one 
of the system's a operations. The fact of the execution occurring in a particular 
operation is a state of the user. During any interval, the user is found performing 
exactly one of a finite number of mutually exclusive and exhaustive states that may be 
labeled 0, 1 , 2, ... a. In this representation of the system, there is a stochastic process 

{X^ } , where the random variables are observed at intervals ^ = 1 , 2, . . . . and where 
each random variable may take on any one of the {a + 1) integers, from the state 
space O- {0, 1, 2, 

Each user may potentially bring his/her own distinct behavior to the system. 
Thus, each user will have a imique characteristic operational profile. It is a 
characteristic, then, of each user to induce a probability function p. - Pr[X = /] on 
the set of operations O. In that these operations are mutually exclusive, the induced 
probability function is a multinomial distribution. 

As a software system is specified designed, the user requirements 
specifications, the set O must be mapped on a specific set of functionalities Fhy 
systems designers. The set Frepresents the design specifications for the system. As 
per the earlier discussion, each operation is implemented by one or more 
functionalities. The transition from one fimctionality to another may be also 
described as a stochastic process. Thus, a new indexed collection of random variables 
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{Y^} may be defined, representing the individual transitions events among particular 
functionalities. At any particular interval, a given operation expresses exactly one of 

- the system's b + l functionalities. During any interval, the user is found performing 

exactly one of a finite number of mutually exclusive and exhaustive states that may be 
5 labeled 0, 1, 2, . . i. In this representation of the system, there is a stochastic process 
{I^ } , where the random variables are observed at intervals / = 0, 1 , 2, . . , and where 
each random variable may take on any one of the (6 + 1) integers, from the state space 

:l^ F= {0,1, 2, ...,&}. 

When a program is executing a given operation, say , it will distribute its 
10 activity across the set of functionalities i^''*^ . At any arbitrary interval, k, during the 
expression of Oj^ , the program will be executing a functionality g F^''*^ with a 
pj probability, Pr[F„ =i\X = k]^ From this conditional probability distribution for all 

n 

'j: operations, the functional profile for the design specifications may be derived as a 

i6 function of a user operational profile, as: 

I 15 Pr[7 = /] = 2, Pr[X = j]?r[Y = i\X = j]. 

m 

Alternatively: 



m 



The next logical step is to study the behavior of a software system at the 

m 

module level. Each of the functionalities is implemented in one or more program 
20 modules. The transition from one module to another may be also described as a 

stochastic process. Therefore, a third indexed collection of random variables {Z^} 
may be defined, as before, representing the individual transitions events among the set 
of program modules. At any particular interval, a given functionality is found to be 
executing exactly one of the system's c modules. The fact of the execution occurring 
25 in a particular module is d, state of the system. During any interval, the system is 
found executing exactly one of a finite number of mutually exclusive and exhaustive 
states (program modules) that may be labeled 0, 1, 2, . . c. In this representation of 

the system, there is a stochastic process {Z,} , where the random variables are 
observed at epochs ^ = 0, 1, 2, . . . aad where each random variable may take on any 
30 one of tiie {c + 1) integers, from the state space M = {0, 1 , 2, . . . , c} . 
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Each functionality j has a distinct set of modules M^^ that it may cause to 
execute. At any arbitrary interval n during the expression of fj , the program will be 
executing a module e My. with a probability % = ^i^n = ' 1 = ^] . From this 

condition probability distribution for all functionalities, the module profile for the 
system may be derived as a function of the system functional profile as follows: 

Pr[Z = /] = "^.YxlY = y]Pr[Z = / 1 7 = j] . 

Again, 

r,=^^Wj?r[Z = i\Y = j], 

The module profile r ultimately depends on the operational profile as can be 
seen by substituting for Wj in the equation above. 

= HjT.pMY = j\X^ k]?x[Z = i\Y = n 

Each distinct operational scenario creates its own distinct module profile. It is 
this fact that may be exploited in the detection of unwanted or intrusive events. 

In summary, when a user is exercising a system, the software will be driven 
through a sequence of transitions from one instrumentation point to the next, 

S = (m^f, , m^^ , m^^ , . . .) , where represents a transition from instrumentation point a 

to instrumentation point b. Over a fixed number of epochs, each progressive 
sequence of n observations will represent a new execution profile It represents a 
sample drawn from a pool of nominal system behavior. Thus, the series of sequences 

S = (iSp iS^.+i, 5.+25. ..) above will generate a family of execution profiles 

(p. ,p.^i,p.^2 V • •) • What becomes clear after a period of observation is that the range 
of behavior exhibited by a system and expressed in sequences of execution profiles is 
highly constrained. There are certain standard behaviors that are demonstrated by the 
system during its normal operation. Abnormal system activities will create significant 
disturbances in the nominal system behavior. 

The whole notion of anomaly detection would be greatly facilitated if the 
functionalities of the system were known and well defined. It would also be very 
convenient if there were a precise description of the set of operations for the software. 
Indeed, if these elements of software behavior were known and precisely specified. 
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the likelihood of vulnerabilities in the system would diminish greatly. In the absence 
of these specifications, it will be assumed that neither the operational profiles nor the 
functional profiles can be observed directly. Instead, the distribution of activity 
among the program modules must be observed in order to make inferences about the 
5 behavior of the system. 

The presently preferred implementations of the present invention, then, 
represent a new real-time approach to detect aberrant modes of system behavior 
induced by abnormal and unauthorized system activities. Within the category of 
aberrant behavior, there are two distinct types of problems. First, a user may be 

10 executing a legitimate system operation (technically an operation oeO^) for which 
he is not authorized. Second, a user (hacker) could be exploiting his knowledge of the 
implicit operations of the system by executing an operation oeOj, Each of these two 
different types of activities will create a distinct and observable change in the state of 
the software system and its nominal activity. Violations in the security of the system 

15 will result in behavior that is outside the normal activity of the system and thus resiit 
in a perturbation in the normal profiles. These violations are detected by the analysis 
of the profiles generated fi^om an instrumented software system against a database of 
established normal behavior. 

It is important to note that the present invention is broadly applicable to almost 

20 any type of software and can monitor activity occurring in any application or 
operating system to detect anomalies indicative of intrusive behavior. Prior art 
intrusion detection systems generally monitor events firom the outside in and thus, can 
overlook an intrusion because they do not respond to variations in software internal 
activity that is not logged. In contrast, a system in accordance with the present 

25 invention can operate in real time, from within the application being monitored, and is 
able to respond to subtle changes that occur as a result of anomalous system behavior. 
Furthermore, since the inventive system can be applied to any type of computational 
activity, it can be used to monitor almost any type of software system and detect 
intrusions, for example, in software running a web server, or in database management 

30 systems, or operating system shells, or file management systems. Any software that 
may impacted by deliberate misuse may be instrumented and monitored to detect such 
misuse or intrusion. 
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The Flowgraph Representation of a Program 

The internal structure of a program module can also be instrumented to 
enhance the detectability of anomalous program activity. The internal structure of a 
5 program module will be modeled in terms of its control flowgraph. A control 

flowgraph of a program module is constructed from a directed graph representation of 
the program module that can be defined as follows: 

• A directed graph, G^(N,E, s, t), consists of a set of nodes,i\^, a set of 
edges sl distinguished node s, the start node and a distinguished node^, 

10 the exit node. An edge is an ordered pair of nodes, ^, 6). 

• The in-degree / (a) of node a is the number of entering edges to a, 

• The out-degree O (a) of node a is the number of exiting edges from a. 

The flowgraph representation of a program, F = (E', AT', s, t), is a directed graph that 
15 satisfies the following properties. 

• There is a unique start node s such that / (s) = 0. 

• There is a unique exit node t such that O (t) = 0. 

All other nodes are members of exactly one of the following three categories. 

20 • Processing Node: it has one entering edge and one exiting edge. They 

represent processing node I (a) = 1 and O (a) = L 

• Predicate Node: represents a decision point in the program as a result of 

if statements, case statements or any other statement that will cause an 
alteration in the control flow. For a predicate node a, I(a)= 1 and O (a) > 
25 L 

• Receiving Node: it represents a point in the program where two or more 
control flows join, for example, at the end of a while loop. For a receiving 
node a, 1(a) > 1 and 0(a) = L 

If (a, b) is an edge from node a to node 6, then node a is an immediate 
30 predecessor of node b and node b is an immediate successor of nodea. The set of all 
immediate predecessors for node a is denoted as/P (a). The set of all immediate 
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successors for node b is denoted as IS (b). No node may have itself as a successor. 
That is, a may not be a member of IS (a). In addition, no processing node may have a 
processing node as a successor node. All successor nodes to a processing node must 
be either predicate nodes or receiving nodes. Similarly, no processing node may have 
a processing node as its predecessor. 

A path P in a flowgraph F is a sequence of edges < a^a^ , a^a^ , . . . , Uj^^fi^f > 
where all ai(I^ 1, ,.,,N) axe elements of N\ P is a path from node aj to node a„. An 
execution path in F is any path P from s to /. 
Instrumentation Points 

As the software executes there are specific points in the execution path that 
can be monitored either by hardware or software probes. At the highest level of 
granularity, the structure of the operation of the software system can be analyzed 
through instrumentation placed at the beginning of each program module. The next 
level of instrumentation would also insert instrumentation points prior to the return 
from each program module. 

The lowest level of granularity of instrumentation is achieved by inserting 
instrumentation points at selected points in a program module based on its flowgraph 
representation. At this level, probes are inserted at every edge exiting a predicate 
node. That is, if any node, a, in the flowgraph representing the program module has 
multiple successor nodes, i^fej v • 5 then probes will be inserted into the edges 
represented by ai^ , ai?2 , . . . ? . 

Simplifying the Dynamic Execution Profiles 

We will now turn to an inventive mefliodology to reduce the dimensionality of 
the dynamic software profiling problem space. As discussed above, modem software 
systems are composed of a plurality of modules in an instrumented software system 
321 as shown in FIGURE 3 A, and each program module generally implements a 
specific flmctional requirement. A methodology in accordance with this aspect of the 
present invention may be utilized to identify distinct sources of variation in the 
interaction of program modules and substantially reduce the number of actual 
obs^ations necessary to characterize the behavior of an executing software system. 
A mapping vector 322 is constructed that will send observations from each software 
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instrumentation point to a single virtual execution domain 323. This will serve to 
reduce the dimensionality of the problem space at least two orders of magnitude. 

As the program executes, each module may call another module. Through a 
specific sequence of calls fi:om one module to another, a higher level of functional 
5 granularity is achieved. Moreover, there may be a very tight binding between the 
sequences of calls to program modules - that is, one module may always call another 
module. If this is the case, there is a high correlation between the activities of the two 
modules. They are bound so closely together that they interact as one functional unit. 
The present invention provides a statistical methodology to identify a small working 
10 set of cohesive program modules that represent the dynamic bindings among program 
modules as they execute. 

As a program executes, it will encounter the mstrumentation points that 
monitor its activity. Each of these encounters will define one epoch in the execution 
of the program. 

15 In the course of normal program activity, the telemetry from the 

instrumentation points at each epoch are recorded in an execution profile for the 
program. This execution profile is an n element vector, X, containing one entry for 
each program module. Each element, , of this vector, X, will contain a frequency 
count for the number of times that the corresponding instrumentation point has 



m 



j^i 20 executed during an era of k epochs. Thus, k = ^x^. 

/=i 

One of the major problems in monitoring the execution behavior of a modem 
software system is that the value of A:will become very large in a very short period on 
a real time clock reference framework. To be meaningful at all, these data must be 
captured periodically and recorded for later analysis. In this context, an execution 
25 profile might be recorded whenever the value of A: reaches a total count of, say, K, at 
which time the contents of tiie original execution profile vector would be reset to zero, 
i.e., x^ = 0, Vf = 1,2,. . .« . The recorded activity of a program during its last L = JK 
epochs will be stored in a sequence of ; execution profiles, Xj , Xj , . . . , X^. . Thus, the 
value x.^ will represent the frequency of execution of the i^^ program module on the 
30 execution profile. 
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The behavior of a program is embodied in the manner with which the program 
control is transferred in the call graph or module flowgraph representation while the 
program is running. This interaction between any two program instrumentation points 

and during the execution of a program overL epochs oij execution profiles 
may be expressed in terms of their covariance, 

1 ^ 

The principal role of the behavioral analysis of program execution will be in 
the area of application monitoring. In a computer security context, changes in the 
correlation of the modules over time firom some nominal pattern will indicate the 
presence of new behavior that is potentially threatening. From an availability 
perspective, these same changes in program behavior may well mean that the program 
is now executing in a new and uncertified manner. 

A potential problem with using raw execution vectors is that the volume of 
data generated by a typical program while it is executing is very large. Analysis of 
these data in real time can generate prohibitively large computational loads on the 
monitoring system. An objective of the methodology presented here is to reduce the 
dimensionality of the problem firom a very large set of n instrumentation point whose 
activity is highly correlated to a much smaller set of m virtual instrumentation points 
whose activity is orthogonal or uncorrelated. 

To achieve this reduction in the dimensionality of the problem space, the 
statistical techniques of principal components analysis or principal factor analysis 
may be employed, alfliough the present invention is not limited to these techniques. 
The goal of principal components analysis is to construct linear combinations of the 
original observations in the data matrix that account for a large part of the total 
variation. Factor analysis, on the other hand, is concerned with the identification of 
structure within a data matrix. In principal components analysis the structure of the 
virtual system is discovered. In factor analysis the structure of the virtual system is 
known a priori and the variance among the factor structure is apportioned to the 
variables in the data matrix. 

For either of these two techniques, the « x 7, j>n data matrix 
D = Xj,X2,...,X^. will be factored into m virtual orthogonal module components. 
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Associated with each of the new m orthogonal components will be an eigenvalue 

n 

where ^X^-n. The number of components extracted in Ihe new orthogonal 

1=1 

structure will be determined by a stopping rule based on the eigenvalues. Examples of 
two possible stopping rules would be 

1) extract all components whose eigenvalues are greater that some preset 
threshold, say 1.0, or 

2) extract those components such that the proportion of variation represented 
1 

by V = is at least equal to a present value such as 0.9 or better. 



«7=i 

A product of the principal component or factor analysis solution to the 
1 0 factoring of the data matrix, D, is the factor pattern or principal components structure, 

Hi 

pi p. These two statistical techniques are similar. The principal components analysis 

^ technique seeks to explain the sources of variation in a set of correlated variables, 

^fll The factor analytic technique seeks to explain the correlations among the variables. 

2s^; The PC A technique will produce a unique solution whereas the factor analytic model 

f y 15 is not unique. In the principal components model the principal components are 
'm^ uncorrelated unconditionally whereas in the factor analytic solution the factors are 

uncorrelated only within the common factor space. The matrix, P, is an n x w 



stnicture whose rows, p^j , contain values showing the degree of relationship of the 
^ variation of the program module and the factor or principal component. Let 

20 = ^ . Let Oj = index(9^.) represent the column number in which the 

corresponding value qj occurs. If, for row 7, the largest value of Py occurs in column 

5, then o^. = 5 . This indicates that the program module, nij , is most clearly related to 
the fifth factor or principal component. 

The vector, O, whose elements are the index values, Oj , is defined to be the 
25 mapping vector for Ihe execution profile vector. This mapping vector will be 

employed to send the probe event frequencies recorded in the execution profile vector 
onto their corresponding virtual module equivalents. That is, the firequency count for 
the instrumentation point A: in an execution profile vector will be represented by . 
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The mapping vector element Oj^ will contain the index value for the principal 
component (factor), say 4 that A: maps into. 

In essence, the principal components or factor analysis techniques will serve to 
identify m distinct and orthogonal sources of variation in the data vector D 
representing the original n program instrumentation points. These new m orthogonal 
domains will represent the actual manner in which the software system is executing. 
Whereas the raw measurements taken on the non-reduced software system reflectw 
distinct instrumentation points, these points are actually interacting asm distinct units 
m<n. 

On each of the original raw profiles, the instrumentation point fi-equency count 
was represented in the elements, x- j , of the profile vector, X. . After the mapping 
vector has been established to map each of the execution interactions as reflected by 
the individual instrumentation points into a new virtual execution domain, thQ virtual 
profile vector, , will be employed to contain the firequency counts for any of the 
interactions among the virtual execution domain sets. Thus, 



Each of the vectors X. represents a point in an n dimaisional space. Each of 
the vectors X represents a point in a substantially reduced m dimensional space. 
Each point in the n dimensional raw vector represents a behavioral observation of the 
software system during K epochs. The firequency counts of the individual 
instrumentation points are, however, correlated. Each of the points in a reduced 
virtual execution domain similarly represents a behavioral observation also during the 
same K epochs. The frequencies counts of each of these points are, however, not 
correlated. 

Computer System Suitable for Implementing the Present Invention 

With reference to FIGURE 4, a generally conventional computer 401 is 
illustrated, which is suitable for use in connection with practicing the present 



n 



>'*.. = £/U,.) where, 




0 if Oi^itk 
otherwise ' 
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invention. Alternatively, a workstation, laptop, distributed systems environment, or 
other type of computational system may instead be used. Computer 401 includes a 
processor chassis 402 in which are optionally mounted a floppy or other removable 
media disk drive 403, a hard drive 404, a motherboard populated with appropriate 
integrated circuits (not shown), and a power supply (also not shown). A monitor 405 
(or other display device) is included for displaying graphics and text generated by 
software programs, and more specifically, for alarm levels of the present invention. A 
mouse 406 (or other pointing device) is connected to a serial port (or to a bus port) on 
the rear of processor chassis 402, and signals from mouse 406 are conveyed to the 
motherboard to control a cursor on the display and to select text, menu options, and 
graphic components displayed on monitor 405 in response to software programs 
executing on the computer, including any program implementing the present 
invention. In addition, a keyboard 407 is coupled to flie motherboard for entry of text 
and commands that affect the running of software programs executing on the 
computer. 

Computer 401 also optionally includes a compact disk-read only memory 
(CD-ROM) drive 408 into which a CD-ROM disk may be inserted so that executable 
files and data on the disk can be read for transfer into the memory and/or into storage 
on hard drive 404 of computer 401 . Other types of data storage devices (not shown), 
such as a DVD drive or other optical/magnetic media drive, may be included in 
addition to, or as an altemative to the CD-ROM drive. Computer 401 is preferably 
coupled to a local area and/or wide area network and is one of a plurality of such 
computers on the network. 

Although details relating to all of the components mounted on the 
motherboard or otherwise installed inside processor chassis 402 are not illustrated, 
FIGURE 5 is a block diagram illustrating some of the functional components that are 
included. The motherboard includes a data bus 501 to which these functional 
components are electrically connected. A display interface 502, comprising a video 
card, for example, generates signals in response to instructions executed by a central 
processing unit (CPU) 508 that are transmitted to monitor 405 so that graphics and 
text are displayed on the monitor. A hard drive/floppy drive interface 504 is coupled 
to data bus 501 to enable bi-directional flow of data and instructions between data 
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bus 501 and floppy drive 403 or hard drive404. Software programs executed by 
CPU 508 are typically stored on either hard drive 404, or on a CD-ROM, DVD, other 
optical/magnetic high capacity storage media, or a floppy disk (not shown). 
Alternatively, the programs may be stored on a server, e.g., if the computer comprises 
a workstation. A software program including machine language instructions that 
cause the CPU to implement the present invention will likely be distributed either on 
conventional magnetic storage media, on-line via, or on a CD-ROM or other 
optical/magnetic media. 

A serial/mouse port 503 (representative of the two serial ports typically 
provided) is also bi-directionally coupled to data bus 501, enabling signals developed 
by mouse 940 to be conveyed through the data bus to CPU 508. A CD-ROM 
interface 509 connects CD-ROM drive 408 (or other storage device) to data bus 501 . 
The CD-ROM interface may be a small computer systems interface (SCSI) type 
interface or other interface appropriate for connection to and operation of CD-ROM 
drive 408. 

A keyboard interface 505 receives signals from keyboard 407, coupling the 
signals to data bus 501 for transmission to CPU 508. Optionally coupled to data 
bus 501 is a network interface 506 (which may comprise, for example, an Ethernet™ 
card for coupling the computer to a local area and/or wide area network). Thus, data 
used in connection with the present invention may optionally be stored on a remote 
server and transferred to computer 401 over the network to implement the present 
invention. 

When a software program is executed by CPU 508, the machine instructions 
comprising the program that are stored on removable media, such as a CD-ROM, a 
server (not shown), or on hard drive 404 are transferred into a memory 507 via data 
bus 501 . Machine instructions comprising the software program are executed by 
CPU 953, causing it to implement fimctions as described above while running a 
computer program. Memory 507 includes both a nonvolatile read only memory 
(ROM) in which machine instructions used for booting up computer 401 are stored, 
and a random access memory (RAM) in which machine instructions and data are 
temporarily stored when executing application programs, such as a database program 
using the present iuvention. 
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Conclusion 

Although the present invention has been described in connection with the 
preferred form of practicing it and modifications to that preferred form, those of 
ordinary skill in the art will understand that many other modifications can be made 
thereto within the scope of the claims that follow. Accordiagly, it is not intended that 
the scope of the invention in any way be limited by the above description, but instead 
be determined by reference to the claims that follow. 
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